QTM 447 Lecture 11: Vanishing Gradients and Generalization

Kevin McAlister

February 18, 2025

Backpropogation

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Backpropogation is the main workhorse for training DNNs

  • Construct the computational graph for the NN

  • Create initial values for the unknowns

  • Compute the interemediates as a function of the initial values

  • Compute the gradient using backprop

  • Update the unknowns using SGD.

  • Repeat a bunch of times

Backpropogation

For DNNs, there are patterns in the forward and backwards passes that allow us to succinctly write code to compute the complicated gradients!

Quick review:

  • \(\mathbf y\) is a \(N\)-vector of continuous outcomes

  • \(\mathbf X\) is a \(N \times P\) matrix of feature values

  • Loss is MSE

\[\mathcal L(\boldsymbol \theta | \mathbf X, \mathbf y) = \expn (y_i - z_i)^2 = \frac{1}{N} (\mathbf y - \mathbf z)^T(\mathbf y - \mathbf z)\]

Backpropogation

2 layer FCNN

  • Loss layer weights (a \(K_2\) vector)

  • 2nd layer weights (a \(K_1 \times K_2\) matrix)

  • 1st layer weights (a \(P \times K_2\) matrix)

Backpropogation

Loss layer:

Backpropogation

Loss layer (for a single instance):

\[\frac{\partial \mathcal L}{\partial \mathbf z} = \underset{(1 \times 1)}{-2(y - z)}\]

Backpropogation

Output layer:

Backpropogation

Output layer (for one instance):

\[\underset{(1 \times 1)}{z} = \underset{(1 \times K_2)}{\bb^T}\underset{(K_2 \times 1)}{\mathbf h_2} + \underset{(1 \times 1)}{a}\]

\[\frac{\partial z}{\partial a} = \underset{(1 \times 1)}{1}\]

\[\frac{\partial z}{\partial \bb} = \underset{(1 \times K_2)}{\mathbf h_2}\]

\[\frac{\partial z}{\partial \mathbf h_2} = \underset{(1 \times K_2)}{\bb^T}\]

Backpropogation

Loss gradients:

\[\underset{(1 \times 1)}{z} = \underset{(1 \times K_2)}{\bb^T}\underset{(K_2 \times 1)}{\mathbf h_2} + \underset{(1 \times 1)}{a}\]

\[\frac{\partial \mathcal L_i}{\partial a} = \underset{(1 \times 1)}{-2(y - z)}\]

\[\frac{\partial \mathcal L_i}{\partial \bb} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\mathbf h_2}\]

\[\frac{\partial \mathcal L_i}{\partial \mathbf h_2} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T}\]

Backpropogation

Hidden Layer 2:

Backpropogation

Hidden Layer 2 (for one instance):

\[\underset{(K_2 \times 1)}{\mathbf h_2} = \varphi(\mathbf q_2)\]

\[\frac{\partial \mathbf h_2}{\partial \mathbf q_2} = I(\mathbf q_2 > 0)\mathcal I_{K_2}\]

\[\underset{(K_2 \times 1)}{\mathbf q_2} = \underset{(K_2 \times K_1)}{\mathbf W_2^T} \underset{(K_1 \times 1)}{\mathbf h_1} + \underset{(K_2 \times 1)}{\mathbf b_2}\]

\[\frac{\partial \mathbf q_2}{\partial \mathbf b_2} = \underset{(K_2 \times K_2)}{\mathcal I_{K_2}}\]

\[\frac{\partial \mathbf q_2}{\partial \mathbf W_2} = \text{Complex}\]

\[\frac{\partial \mathbf q_2}{\partial \mathbf h_1} = \mathbf W_2^T\]

Backpropogation

Loss gradients:

\[\frac{\partial \mathcal L_i}{\partial \mathbf b_2} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)}\]

\[\frac{\partial \mathcal L_i}{\partial \mathbf h_1} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2^T}\]

\[\frac{\partial \mathcal L_i}{\partial \mathbf W_2^T} = \underset{\mathbf K_2 \times 1}{\mathbf u_2^T} \underset{(1 \times K_1)}{\mathbf h_1}\]

where \(\mathbf u_2\) is the upstream gradient for the second hidden layer:

\[\mathbf u_2 = \frac{\partial \mathcal L_i}{\partial \mathbf q_2} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)}\]

Backpropogation

Hidden Layer 1:

Backpropogation

Hidden Layer 1 (for one instance):

\[\underset{(K_1 \times 1)}{\mathbf h_1} = \varphi(\mathbf q_1)\]

\[\frac{\partial \mathbf h_1}{\partial \mathbf q_1} = I(\mathbf q_1 > 0)\mathcal I_{K_1}\]

\[\underset{(K_1 \times 1)}{\mathbf q_1} = \underset{(K_1 \times P)}{\mathbf W_1^T} \underset{(P \times 1)}{\mathbf x} + \underset{(K_1 \times 1)}{\mathbf b_1}\]

\[\frac{\partial \mathbf q_1}{\partial \mathbf b_1} = \underset{(K_1 \times K_1)}{\mathcal I_{K_1}}\]

\[\frac{\partial \mathbf q_1}{\partial \mathbf W_1} = \text{Complex}\]

\[\frac{\partial \mathbf q_1}{\partial \mathbf x} = \mathbf W_1^T\]

Backpropogation

Loss gradients:

\[\frac{\partial \mathcal L_i}{\partial \mathbf b_1} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)} \otimes \underset{K_2 \times K_1}{\mathbf W_2^T} \odot \underset{(1 \times K_1)}{I(\mathbf q_1 > 0)}\]

\[\frac{\partial \mathcal L_i}{\partial \mathbf x} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)} \otimes \underset{K_2 \times K_1}{\mathbf W_2^T} \odot \underset{(1 \times K_1)}{I(\mathbf q_1 > 0)} \otimes \underset{((K_1 \times P))}{\mathbf W_1^T}\]

\[\frac{\partial \mathcal L_i}{\partial \mathbf W_1^T} = \underset{\mathbf K_1 \times 1}{\mathbf u_2^T} \underset{(1 \times P)}{\mathbf x}\]

where \(\mathbf u_1\) is the upstream gradient for the first hidden layer:

\[\mathbf u_1 = \frac{\partial \mathcal L_i}{\partial \mathbf q_1} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)} \otimes \underset{K_2 \times K_1}{\mathbf W_2^T} \odot \underset{(1 \times K_1)}{I(\mathbf q_1 > 0)}\]

Backpropogation

For a generic linear layer, we always have the same gradient form!

Gradient Problems

The gradient at any layer is dependent on upstream layers through \(\mathbf u\):

\[\mathbf u = \frac{\partial \mathcal L}{\partial z} \otimes \bb^t \odot \varphi'(\mathbf q_D) \otimes \mathbf W_D^T \odot \varphi'(\mathbf q_{D - 1}) \otimes \mathbf W_{D - 1}^T \odot \varphi'(\mathbf q_{D - 2}) \otimes \mathbf W_{D - 2}^T ...\]

where \(\varphi'()\) is the derivative of the activation function

Gradient Problems

As discussed last class, it is possible for this product chain to explode or vanish

  • Multiplying a lot of large values together approaches \(\infty\)

  • Multiplying a lot of small values (close to zero) approaches 0

These are known as exploding gradients or vanishing gradients

Gradient Problems

The exploding gradient problem can be addressed with gradient clipping or carefully setting starting values using Glorot intialization.

Vanishing gradients are harder to detect and deal with!

  • Why do you think vanishing gradients are a problem?

  • Think about training a DNN with an SGD optimizer.

Gradient Problems

One reason we might see a vanishing gradient is due to the choice of activiation function.

An old choice for activation functions were sigmoids

\[\underset{(K_1 \times 1)}{\mathbf h_1} = \varphi(\mathbf q_1)\]

\[\varphi(\mathbf q_1) = \sigma(q_1) = \frac{1}{1 + \exp[-q_1]}\]

Gradient Problems

Gradient Problems

The sigmoid was used a lot because it has a simple continuous derivative:

\[\varphi'(q_1) = \sigma(q_1)(1 - \sigma(q_1))\]

However, this activation function saturates and leads to gradient problems.

Gradient Problems

Gradient Problems

At any layer, the gradient for the weights and the hidden units using the sigmoid activation function:

\[\frac{\partial \mathcal L}{\partial h_{d - 1}} = \mathcal u \odot \sigma(q_d)(1 - \sigma(q_d)) \otimes \mathbf W_d^T\]

\[\frac{\partial \mathcal L}{\partial \mathbf W_D^T} = \mathbf u_2^T \odot (\sigma(q_d)(1 - \sigma(q_d)))^T \otimes \mathbf h_{d-1}\]

If any \(q_d\) is sufficiently large (>3ish) or small (<-3ish), the derivative of the sigmoid activation function will essentially be zero!

Gradient Problems

How can we fix this problem?

  • Prevent hidden values from getting too big or too small (normalization)

  • Modify the activation function so that they don’t saturate (recitifiers)

  • Choose good initial values (Glorot initialization)

  • Modify the DNN architecture such that the updates are additive instead of multiplicative (residual connections; 13.4.4 in PML1; More with CNNs)

Gradient Problems

Non-saturating activation functions:

An activation function where the derivative doesn’t saturate at zero over the real line.

The classic example is RelU:

\[\varphi(q_1) = \max(0,q_1)\]

With the simple derivative of 0 if \(q\) is less than 0 and 1 if it’s greater.

Gradient Problems

Gradient Problems

Gradient Problems

A first thought:

ReLU saturates on the negative side!

You’re right! And this can be problematic. But, the benefits of ReLU outweigh the negatives.

Gradient Problems

The big benefit of these piecewise activation function is that it induces sparsity in the hidden units

  • Each observation only has a few non-zero hidden values across the different hidden units

  • Allows each hidden unit to correspond to a sparse set of hidden features

  • Always continuous activation functions don’t have this explicit property

Gradient Problems

While other approaches to solving the vanishing gradient problem are more popular, we could always come up with variants to the ReLU activation that don’t completely zero out units that are less than zero!

  • Advantage being that the gradient doesn’t shrink to zero!

The cost is specificity of the hidden units to coherent subfunctions of the original feature space.

  • 0’s are nice! Think about the LASSO - it is a lot easier to interpret what coefficients are “important” using L1 penalties over L2 penalties.

Gradient Problems

Leaky ReLU

\[\varphi(\mathbf q_1) = \max(\alpha q_1, q_1)\]

where \(0 < \alpha < 1\)

Gradient Problems

ELU

\[\varphi(\mathbf q_1) = \begin{cases} q_1 & \text{if } q_1 > 0 \\ \alpha(e^{q_1} - 1) & \text{if } q_1 \leq 0 \end{cases}\]

Gradient Problems

Changing the activation function can help!

  • It just loses some of the clarity of the learned hidden features.

  • ReLU promotes sparsity which can lead to less complex models!

Gradient Problems

Not having a rectifier leads the hidden units to not really correspond with meaningful parts of the image!

  • It’s just some weird mishmash of parts

This is one of the hallmarks of DNNs!

  • The deconstruction of instances into parts will yield big benefits when we look at images and texts that have hierarchical feature structures.

Gradient Problems

Even though ReLU is better than the sigmoid for the positive parts of the hidden units in terms of vanishing gradients, it still has problems when a lot of values are placed below zero

  • Large negative weights

  • Bad starting values

When the gradient at a specific hidden unit goes to zero for good (e.g. the weights move highly negative and lead everything to get mapped to zero by ReLU), this is called the dead ReLU problem

Gradient Problems

Batch Normalization

The problem of vanishing gradients occurs when most of the hidden values coming out of a node are mapped exactly to zero through the activation layer.

  • We don’t have any negative values for the hidden values!

This can be problematic as the NN in suboptimal locations along the gradient descent path gets trapped and cannot escape.

  • Dead RelUs

Gradient Problems

If a hidden unit is actually supposed to be mostly zero, it will get there pretty quickly once we’re around the minimum loss coefficients

  • Get everything close to where it needs to be before we get to the minimum of the loss space and then adjust to zero when we get there

  • The advantage is that was don’t have situations where the gradient rushes to zero before we hit the minimum

This can be achieved with minimal additional computation and change to the training procedure through batch normalization of the hidden values!

Gradient Problems

Over a batch of \(B\) instances being considered at iteration \(t\) of the SGD optimizer, the goal is to ensure that each set of \(B\) hidden units - \(\mathbf q_d\), a \(B \times K_d\) matrix of values - has columns that have mean zero and variance 1.

Given a column of hidden values (post-activation):

\[\tilde{\mathbf q}_k = \gamma \odot \hat{\mathbf q}_k + \delta\]

\[\hat{\mathbf q_k} = \frac{\mathbf q_k - \mu_k}{\sqrt{\sigma^2_k + \epsilon}}\]

\[\mu_k = \frac{1}{B} \sum \limits_{q_k \in B} q{j,k}\]

\[\sigma^2_k = \frac{1}{B} \sum \limits_{q_k \in B} (q{j,k} - \mu_k)^2\]

Gradient Problems

In words, we’re standardizing each hidden unit!

  • Think about standardization for regression. What effect does this have on the coefficients?

Gradient Problems

This prevents the weights from ever getting too big or too small.

And prevents large negatives or positives from occurring!

This prevents vanishing and exploding gradients!

Gradient Problems

The cost is that early stopping can lead to unclear activations

  • Less sparsity in the hidden units

However, when we converge to a minimum, we expect that the activations will creep very close to zero

This method can easily be implemented using PyTorch

Overfitting

Up until this point, we haven’t really discussed the problem of overfitting with DNNs

We know that NNs are universal approximators

  • Can achieve zero training error!

This is not desireable if our goal is to create a predictive model!

Overfitting

The longer we train the NN, the more we overfit to the training data

  • NNs are incredibly expressive!

We know that the “tuning parameters” for the NN are the number of hidden units in each layer

  • As the number increases, the more likely we are to overfit to the data

  • Unless we have insanely large sample sizes

Maybe we could try lots of different values for the number of units?

  • Shrug?

Overfitting

Training NNs is pretty quick.

But still computationally expensive.

  • We don’t often have the time or computational power to train hundreds of different NN specifications!

Overfitting

Additional problem: fast fitting

  • We move too quickly towards a minimum

  • Can matter when there are better minima that exist that require small moves to see them

  • Beeline vs. Slow Stroll

  • Really small learning rate?

Overfitting

Instead, different methods are used for a specific NN specification

  • Train a 512 hidden unit NN

  • Use different regularization methods to try to prevent overfitting to the training data

Overfitting

Method 1: Smaller Batch Sizes

  • Let stochastic gradient descent do the work

  • Can y’all come up with some reasons why this would work?

  • Can y’all come up with some reasons why this isn’t always advisable?

Overfitting

Method 2: Early stopping criteria related to a validation set

  • Split the original training data into training and validation sets

  • Monitor the loss w.r.t. the training and validation data at the end of each epoch

  • Stop when the validation loss increases!

A common sense approach

  • We know that we’ll improve the fit to the validation data well at the beginning

  • At a certain point, we learn the underlying functional form as well as we can

  • Every “improvement” after that is just responding to noise!

Overfitting

This approach should be used, but comes with a few cautions.

  • Any thoughts?

Overfitting

Early stopping using a split of training and validation data is a must for fitting DNNs!

  • It’s one sure-fire way that works in any scenario to try to stop the model from reaching the overfitting zone.

Early stopping is just a heuristic method, though.

  • Though it can be throught of as a method of controlling the effective number of parameters

Maybe we can use methods to try to explicitly do this.

Overfitting

Consider a two-layer neural network with 512 hidden units in each layer and 10 categories.

Let’s examine the number of coefficients:

  • First layer: \(512 \times P\)

  • Second layer \(512 \times 512\)

  • Loss layer: \(512 \times 10\)

That is 668,672 coefficients!

Even for the MNIST data with 20,000 training observations, that’s too many parameters!

Overfitting

Most of these parameters can’t actually do anything to help the model!

  • \(P >>> N\)

The model is overidentified

  • However, the model seems to do okay even with this incredibly overidentified parameter space

  • This is a function of SGD and its inherent regularization properties!

Overfitting

Rather than fitting an overidentified model and stopping it early, we can also try to regularize the model space to encourage most weights to be close to zero.

  • The same as Ridge and LASSO!

For a deep neural network with D layers, we have \(D + 1\) sets of weights on the units:

\[\bb, \mathbf W_D, \mathbf W_{D - 1},...,\mathbf W_{1}\]

For each set of weights, define the squared Frobenius norm of the weight matrix as:

\[\|\mathbf W_d\|^2_F = \sum \limits_{i = 1}^{K_{d-1}} \sum \limits_{j = 1}^{K_{d}} w_{ij}^2\]

Overfitting

Fitting a NN with weight decay amounts to fitting a NN with L2 regularization of the weights:

\[ \frac{1}{N} \sum \limits_{i = 1}^N \mathcal L(\boldsymbol \theta | \mathbf x_i , y_i) + \lambda \sum \limits_{d = 1}^D \|\mathbf W_d\|^2_F\]

  • Each set of weights is penalized in the original loss specification

  • Like setting a Ridge penalty for the entire neural network!

Overfitting

Will still eventually get to zero training error (plus a constant from the penalty term)

  • Will get there slower though

Helps to slow the process down to hopefully find better generalizable solutions!

Overfitting

Weight decay helps to close the gap between training and generalization error

  • Useful when the model is heavily overparameterized to give better generalization performance

Tuning \(\lambda\) is more art than science for NNs

  • Just try .01, .1, and 1. Pick the one that does the best

\(\lambda\) sorta controls the number of hidden units!

  • We’re shrinking most weights to 0

  • Equivalent to choosing the number of hidden units without needing to retrain a bunch

Overfitting

We could also induce a L1 penalty on the weights

  • A NN with LASSO

This isn’t popular because SGD doesn’t play very well with L1 penalties

A more poular method of inducing sparse weight sets is to employ dropout

Overfitting

Dropout

With probability \(p\), set the weight between node \(i\) and node \(j\) to zero. Do this randomly for each instance in each iteration.

  • Seems silly, but it works quite well!

Overfitting

Overfitting

Overfitting

Overfitting

Dropout works well because it forces the NN to learn robust dependencies among the hidden values

  • If the connection isn’t always there, then the other connections need to make up for the missingness

  • Akin to an ensemble of networks that will do well on small subsets of the data

  • Think a random forest for NNs!

Overfitting

\[ \begin{equation*} \mathbf{W} = \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} & w_{15} & \ldots \\ w_{21} & w_{22} & w_{23} & w_{24} & w_{25} & \ldots \\ w_{31} & w_{32} & w_{33} & w_{34} & w_{35} & \ldots \\ w_{41} & w_{42} & w_{43} & w_{44} & w_{45} & \ldots\\ w_{51} & w_{52} & w_{53} & w_{54} & w_{55} & \ldots\\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots\\ \end{bmatrix} \end{equation*} \]

Overfitting

\[ \begin{equation*} \mathbf{W} = \begin{bmatrix} 0 & w_{12} & w_{13} & 0 & w_{15} & \ldots \\ w_{21} & 0 & w_{23} & w_{24} & 0 & \ldots \\ w_{31} & w_{32} & 0 & w_{34} & w_{35} & \ldots \\ 0 & w_{42} & w_{43} & 0 & w_{45} & \ldots\\ w_{51} & 0 & w_{53} & w_{54} & 0 & \ldots\\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots\\ \end{bmatrix} \end{equation*} \]

Overfitting

Dropout is sorta like fitting a noisy version of the NN weights

With probability \(p\) of dropout:

\[\theta_{dij} = \mathbf W_{dij} \epsilon_{di}\]

where \(\epsilon_{di} = 1 - p\).

After training a dropout network, we multiply the final weights by \(1 - p\) to get an approximation of the true weights for the network to use for predictions!

Overfitting

Though the “default” dropout is .5, I have found that setting the level to .2 or .25 seems to work really well for a lot of problems

  • You may need to tune this a little

My workflow (that doesn’t work for every problem):

  • Early stopping

  • Dropout if it’s a really picky network

Computationally quick and leads to good generalizable NNs

Overfitting

This is just an overview of generalization methods for NNs

  • A lot of other stuff out there!

DL and PML1/PML2 have a lot of other methods for making NNs generalize better!

Neural Networks

This has been a pretty thorough overview of NN architecture

  • Only applied to tabular data at this point

Next class, we’ll extend our toolkit to include images and methods that respect pixel location

  • Convolutional neural networks!